Goto

Collaborating Authors

 Coles County


Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

Rita, Mathieu, Strub, Florian, Chaabouni, Rahma, Michel, Paul, Dupoux, Emmanuel, Pietquin, Olivier

arXiv.org Artificial Intelligence

While Reinforcement Learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself. Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective. Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations' and LLM's rewards rather than directly maximizing the reward function. This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural and diverse language generation. We show the effectiveness of RCfD on three language tasks, which achieves comparable performance to carefully tuned baselines while mitigating ROO.


Separability, Contextuality, and the Quantum Frame Problem

Fields, Chris, Glazebrook, James F.

arXiv.org Artificial Intelligence

We study the relationship between assumptions of state separability and both preparation and measurement contextuality, and the relationship of both of these to the frame problem, the problem of predicting what does not change in consequence of an action. We state a quantum analog of the latter and prove its undecidability. We show how contextuality is generically induced in state preparation and measurement by basis choice, thermodynamic exchange, and the imposition of a priori causal models, and how fine-tuning assumptions appear ubiquitously in settings characterized as non-contextual.


Gravilon: Applications of a New Gradient Descent Method to Machine Learning

Kelterborn, Chad, Mazur, Marcin, Petrenko, Bogdan V.

arXiv.org Machine Learning

Gradient descent algorithms have been used in countless applications since the inception of Newton's method. The explosion in the number of applications of neural networks has re-energized efforts in recent years to improve the standard gradient descent method in both efficiency and accuracy. These methods modify the effect of the gradient in updating the values of the parameters. These modifications often incorporate hyperparameters: additional variables whose values must be specified at the outset of the program. We provide, below, a novel gradient descent algorithm, called Gravilon, that uses the geometry of the hypersurface to modify the length of the step in the direction of the gradient. Using neural networks, we provide promising experimental results comparing the accuracy and efficiency of the Gravilon method against commonly used gradient descent algorithms on MNIST digit classification.